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ABSTRACT 

We explore the use of semantic word embeddings [W 16 
|12| in text segmentation algorithms, including the C99 seg¬ 
mentation algorithm and new algorithms inspired by 

the distributed word vector representation. By developing 
a general framework for discussing a class of segmentation 
objectives, we study the effectiveness of greedy versus ex¬ 
act optimization approaches and suggest a new iterative re¬ 
finement technique for improving the performance of greedy 
strategies. We compare our results to known benchmarks 
(T§[^§|5, using known metrics . We demonstrate 

state-of-the-art performance for an untrained method with 
our Content Vector Segmentation (CVS) on the Choi test 
set. Finally, we apply the segmentation procedure to an in- 
the-wild dataset consisting of text extracted from scholarly 
articles in the arXiv.org database. 


Categories and Subject Descriptors 

1.2.7 [Natural Language Processing]: Text Analysis 


General Terms 

Information Retrieval, Clustering, Text 


Keywords 

Text Segmentation, Text Mining, Word Vectors 


1. INTRODUCTION 

Segmenting text into naturally coherent sections has many 
useful applications in information retrieval and automated 
text summarization, and has received much past attention. 
An early text segmentation algorithm was the TextTiling 
method introduced by Hearst in 1997. Text was scanned 
linearly, with a coherence calculated for each adjacent block, 
and a heuristic was used to determine the locations of cuts. 
In addition to linear approaches, there are text segmenta¬ 
tion algorithms that optimize some scoring objective. An 
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early algorithm in this class was Choi’s C99 algorithm 
in 2000, which also introduced a benchmark segmentation 
dataset used by subsequent work. Instead of looking only 
at nearest neighbor coherence, the C99 algorithm computes 
a coherence score between all pairs of elements of textQ 
and searches for a text segmentation that optimizes an ob¬ 
jective based on that scoring by greedily making a succes¬ 
sion of best cuts. Later work by Choi and collaborators 
1^ used distributed representations of words rather than a 
bag of words approach, with the representations generated 
by LSA [^. In 2001, Utiyama and Ishahara introduced a 
statistical model for segmentation and optimized a poste¬ 
rior for the segment boundaries. Moving beyond the greedy 
approaches, in 2004 Fragkou et al. attempted to hnd 
the optimal splitting for their own objective using dynamic 
programming. More recent attempts at segmentation, in¬ 
cluding Misra et al. 15 and Riedl and Biemann 18 , used 


LDA based topic models to inform the segmentation task. 
Du et al. consider structured topic models for segmenta¬ 
tion [^. Eisenstein and Barzilay and Dadachev et al. 

both consider a Bayesian approach to text segmentation. 
Most similar to our own work, Sakahara et al. consider 
a segmentation algorithm which does affinity propagation 
clustering on text representations built from word vectors 
learned from word2vec [14| . 

For the most part, aside from [^, the non-topic model 
based segmentation approaches have been based on rela¬ 
tively simple representations of the underlying text. Recent 
approaches to learning word vectors, including Mikolov et 
al.’s word2vec |14| , Pennington et al.’s CloVe |16] and Levy 
and Goldberg’s pointwise mutual information |12| , have seen 
remarkable success in solving analogy tasks, machine trans¬ 
lation [^, and sentiment analysis [^. These word vector 
approaches attempt to learn a log-linear model for word- 
word co-occurrence statistics, such that the probability of 
two words (w, w') appearing near one another is propor¬ 
tional to the exponential of their dot product. 


P{w\w') = 


exp('u; • w') 
E„exp(u-w') 


( 1 ) 


The method relies on these word-word co-occurrence statis¬ 
tics encoding meaningful semantic and syntactic relation¬ 
ships. Arora et al. have shown how the remarkable per¬ 
formance of these techniques can be understood in terms of 
relatively mild assumptions about corpora statistics, which 

^By ‘elements’, we mean the pieces of text combined in order 
to comprise the segments. In the applications to be consid¬ 
ered, the basic elements will be either sentences or words. 
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in turn can be recreated with a simple generative model. 

Here we explore the utility of word vectors for text seg¬ 
mentation, both in the context of existing algorithms such 
as C99, and when used to construct new segmentation ob¬ 
jectives based on a generative model for segment formation. 
We will first construct a framework for describing a family 
of segmentation algorithms, then discuss the specihc algo¬ 
rithms to be investigated in detail. We then apply our mod¬ 
ified algorithms both to the standard Choi test set and to a 
test set generated from arXiv.org research articles. 

2. TEXT SEGMENTATION 

The segmentation task is to split a text into contiguous 
coherent sections. We hrst build a representation of the text, 
by splitting it into N basic elements, Vi (i = 1,..., N), each 
a D-dimensional feature vector Via. (a = 1,...,D) repre¬ 
senting the element. Then we assign a score o{i,j) to each 
candidate segment, comprised of the i**' through {j — 1)^*' 
elements, and finally determine how to split the text into 
the appropriate number of segments. 

Denote a segmentation of text into K segments as a list 
of K indices s = (si,si,--- ,sk), where the fc-th segment 
includes the elements Vi with Sk-\. < i < Sk, with so = 
0. For example, the string “aaabbcccdd” considered at the 
character level would be properly split with s = (3, 5,8,10) 
into (“aaa”, “bb”, “ccc”, “dd”). 

2.1 Representation 

The text representation thus amounts to turning a plain 
text document T into an {N x D)-dimensional matrix V, 
with N the number of initial elements to be grouped into 
coherent segments and D the dimensionality of the element 
representation. For example, if segmenting at the word level 
then N would be the number of words in the text, and each 
word might be represented by a D-dimensional vector, such 
as those obtained from GloVe [16| . If segmenting instead at 
the sentence level, then N is the number of sentences in the 
text and we must decide how to represent each sentence. 

There are additional preprocessing decisions, for example 
using a stemming algorithm or removing stop words before 
forming the representation. Particular preprocessing deci¬ 
sions can have a large effect on the performance of segmen¬ 
tation algorithms, but for discussing scoring functions and 
splitting methods those decisions can be abstracted into the 
specihcation of the N x D matrix V. 

2.2 Scoring 

Having built an initial representation the text, we next 
specify the coherence of a segment of text with a scoring 
function a{i,j), which acts on the representation V and re¬ 
turns a score for the segment running from i (inclusive) to 
j (non-inclusive). The score can be a simple scalar or more 
general object. In addition to the scoring function, we need 
to specify how to return an aggregrate score for the entire 
segmentation. This score aggregation function © can be as 
simple as adding the scores for the individual segments, or 
again some more general function. The score S{s) for an 
overall segmentation is given by aggregating the scores of 
all of the segments in the segmentation: 

S{s) = cr(0. Si) © cr(si, S 2 ) © ■ ■ ■ © g{sk-i,sk) . (2) 

Finally, to frame the segmentation problem as a form of 
optimization, we need to map the aggregated score to a sin¬ 


gle scalar. The key function ([[•]]) returns this single number, 
so that the cost for the above segmentation is 

D(s) = lS{s)j . (3) 

For most of the segmentation schemes to be considered, 
the score function itself returns a scalar, so the score aggre¬ 
gation function © will be taken as simple addition with the 
key function the identity, but the generality here allows us 
to incorporate the C99 segmentation algorithm into the 
same framework. 

2.3 Splitting 

Having specihed the representation of the text and scor¬ 
ing of the candidate segments, we need to prescribe how to 
choose the hnal segmentation. In this work, we consider 
three methods: (1) greedy splitting, which at each step in¬ 
serts the best available segmentation boundary; (2) dynamic 
programming based segmentation, which uses dynamic pro¬ 
gramming to find the optimal segmentation; and (3) an it¬ 
erative refinement scheme, which starts with the greedy seg¬ 
mentation and then adjusts the boundaries to improve per¬ 
formance. 

2.3.1 Greedy Segmentation 

The greedy segmentation approach builds up a segmenta¬ 
tion into K segments by greedily inserting new boundaries 
at each step to minimize the aggregate score: 

= {N} (4) 

= arg min C(s* U {i}) (5) 

ie[i,iv) 

until the desired number of splits is reached. Many published 
text segmentation algorithms are greedy in nature, including 
the original C99 algorithm [^. 

2.3.2 Dynamic Programming 

The greedy segmentation algorithm is not guaranteed to 
find the optimal splitting, but dynamic programming meth¬ 
ods can be used for the text segmentation problem formu¬ 
lated in terms of optimizing a scoring objective. For a de¬ 
tailed account of dynamic programming and segmentation 
in general, see the thesis by Terzi Dynamic program¬ 
ming as been applied to text segmentation in Fragkou et al. 
[10| , with much success, but we will also consider here an 
optimizaton of the the C99 segmentation algorithm using a 
dynamic programming approach. 

The goal of the dynamic programming approach is to split 
the segmentation problem into a series of smaller segmenta¬ 
tion problems, by expressing the optimal segmentation of the 
first n elements of the sequence into k segments in terms of 
the best choice for the last segmentation boundary. The ag¬ 
gregated score S(n, k) for this optimal segmentation should 
be minimized with respect to the key function I-]: 

S{n, 1) = cr(0, n) (6) 

H 

S{n, k) = min S(l, A; — 1) © a(l, n) . (7) 

l<n 

While the dynamic programming approach yeilds the op¬ 
timal segmentation for our decomposable score function, it 
can be costly to compute, especially for long texts. In prac¬ 
tice, both the optimal segmentation score and the resulting 
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segmentation can be found in one pass by building up a ta¬ 
ble of segmentation scores and optimal cut indices one row 
at a time. 

2.3.3 Iterative Relaxation 
Inspired by the popular Lloyd algorithm for fc-means, we 
attempt to retain the computational benefit of the greedy 
segmentation approach, but realize additional performance 
gains by iteratively refining the segmentation. Since text 
segmentation problems require contiguous blocks of text, a 
natural scheme for relaxation is to try to move each segment 
boundary optimally while keeping the edges to either side of 
it fixed: 

s,7 = argmin (cr(0, Si) © ■ • ■ 

© (j(sfe_i, 1) © a{l, Sfc+i) © ■ • ■ © s^k)) (8) 

= argmin 5 (s‘- {4} U {/}) (9) 

4+i) 

We will see in practice that by 20 iterations it has typically 
converged to a fixed point very close to the optimal dynamic 
programming segmentation. 


3. SCORING FUNCTIONS 

In the experiments to follow, we will test various choices 
for the representation, scoring function, and splitting method 
in the above general framework. The segmentation algo¬ 
rithms to be considered fall into three groups: 


3.1 C99 Segmentation 

Choi’s C99 algorithm was an early text segmentation 
algorithm with promising results. The feature vector for an 
element of text is chosen as the pairwise cosine distances 
with other elements of text, where those elements in turn 
are represented by a bag of stemmed words vector (after 
preprocessing to remove stop words): 


Aij — 


fi,w fj,^ 


/f,™ E„ fl 


( 10 ) 


with the frequency of word w in element i. The pair¬ 
wise cosine distance matrix is noisy for these features, and 
since only the relative values are meaningful, C99 employs a 
ranking transformation, replacing each value of the matrix 
by the fraction of its neighbors with smaller value: 

^ - I ^ ^ ^ ’ 

i — r l2<l<i-\-r 12 j —r/2<m<j+r/2 

( 11 ) 

where the neighborhood is an r x r block around the en¬ 
try, the square brackets mean 1 if the inequality is satisfied 
otherwise 0 (and values off the end of the matrix are not 
counted in the sum, or towards the normalization). Each 
element of the text in the C99 algorithm is represented by a 
rank transformed vector of its cosine distances to each other 
element. 

The score function describes the average intersentence sim¬ 
ilarity by taking the overall score to be 


c(4 


Ei; Pk 
Efc 


( 12 ) 


where 4 = Vij is the sum of all 

ranked cosine similarities in a segment and a* = (sj;+i —s*,)^ 
is the squared length of the segment. This score function 
is still decomposable, but requires that we define the local 
score function to return a pair, 

’ (13) 

i<k<.j i<k<ij 

with score aggregation function defined as component addi¬ 
tion, 

((?!, Ol) © {/32, 0-2) = (/3l + P 2 , Ol + 02 ) , (14) 


and key function defined as division of the two components. 


.«)! = - 

a 


(15) 


While earlier work with the C99 algorithm considered only 
a greedy splitting approach, in the experiments that follow 
we will use our more general framework to explore both op¬ 
timal dynamic programming and refined iterative versions 
of C99. Followup work by Choi et al. explored the effect 
of using combinations of LSA word vectors in eq. (101 in 
place of the Below we will explore the effect of using 

combinations of word vectors to represent the elements. 


3.2 Average word vector 

To assess the utility of word vectors in segmentation, we 
first investigate how they can be used to improve the C99 
algorithm, and then consider more general scoring functions 
based on our word vector representation. As the represen¬ 
tation of an element, we take 

Vik ~ ^ ) fiinV-wk , (^^) 


with fi-u) representing the frequency of word w in element i, 
and v-uik representing the component of the word vector 
for word w as learned by a word vector training algorithm, 
such as word2vec or GloVe [16] . 

The length of word vectors varies strongly across the vo¬ 
cabulary and in general correlates with word frequency. In 
order to mitigate the effect of common words, we will some¬ 
times weight the sum by the inverse document frequency 
(idf) of the word in the corpus: 

Ufc — ^ ) fiw log -T- V-wk , (^'^) 

Cy w 
w 

where df^ is the number of documents in which word w 
appears. We can instead normalize the word vectors before 
adding them together 

Vik = fiwVwk Vnik = „ , (18) 

V VJ2k <k 

or both weight by idf and normalize. 

Segmentation is a form of clustering, so a natural choice 
for scoring function is the sum of square deviations from the 
mean of the segment, as used in fc-means: 

I k 

1 

where nk{i,j) = ^Vik , (20) 

1—1 ^^ 

1 — i 
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and which we call the Euclidean score function. Generally, 
however, cosine similarity is used for word vectors, making 
angles between words more important than distances. In 
some experiments, we therefore normalize the word vectors 
first, so that a euclidean distance score better approximates 
the cosine distance (recall |n — wl^ = \v\\ + — 2v ■ w = 

2{1 — V ■ w) for normalized vectors). 

3.3 Content Vector Segmentation (CVS) 

Trained word vectors have a remarkable amount of struc¬ 
ture. Analogy tasks such as man:woman::king:? can be 
solved by finding the vector closest to the linear query: 

’^^woman ^man -f Uking . 

Arora et al. constructed a generative model of text that 
explains how this linear structure arises and can be main¬ 
tained even in relatively low dimensional vector models. The 
generative model consists of a content vector which under¬ 
goes a random walk from a stationary distribution defined 
to be the product distribution on each of its components Ck, 
uniform on the interval :^] (with D the dimension¬ 

ality of the word vectors). At each point in time, a word 
vector is generated by the content vector according to a log- 
linear model: 

P{w\c) = ^ exp(u; • c) , ^ exp(n • c) . (22) 


given by 

Cfc(i, j) = sign ^ rui.fej ^ . ( 29 ) 

The maximum likelihood content vector thus has compo¬ 
nents depending on whether the sum of the word 

vector components in the segment is positive or negative. 

This score function will turn out to generate some of the 
most accurate segmentation results. Note that CVS is com¬ 
pletely untrained with respect to the specific text to be seg¬ 
mented, relying only on a suitable set of word vectors, de¬ 
rived from some corpus in the language of choice. While 
CVS is most justifiable when working on the word vectors 
directly, we will also explore the effect of normalizing the 
word vectors before applying the objective. 

4. EXPERIMENTS 

To explore the efficacy of different segmentation strategies 
and algorithms, we performed segmentation experiments on 
two datasets. The first is the Choi dataset [^, a common 
benchmark used in earlier segmentation work, and the sec¬ 
ond is a similarly constructed dataset based on articles up¬ 
loaded to the arXiv, as will be described in Section [4.3| All 
code and data used for these experiments is available on- 

linfl 


The slow drift of the content vectors helps to ensure that 
nearby words obey with high probability a log-linear model 
for their co-occurence probability: 

\ogP{w,w') = ^ ||u„ -I- - 21ogZ± o(l) , (23) 

for some fixed Z. 

To segment text into coherent sections, we will boldly as¬ 
sume that the content vector in each putative segment is 
constant, and measure the log likelihood that all words in the 
segment are drawn from the same content vector c. (This is 
similar in spirit to the probabilistic segmentation technique 
proposed by Utiyama and Isahara [^.) Assuming the word 
draws {wi} are independent, we have that the log likelihood 

logP({wi}|c) = ^logP(wi|c) oc • c (24) 

i i 

is proportional to the sum of the dot products of the word 
vectors Wi with the content vector c. We use a maximum 
likelihood estimate for the content vector: 

c = argmaxlogP(c|{u;i}) (25) 

C 

= arg max (logP({wi}|c) + logP(c) - logP({wi})^ (26) 

oc arg max Wi ■ c s.t.- < Ck < —!= ■ (27) 

c ^ ^/D ^/D 

This determines what we will call the Content Vector Seg¬ 
mentation (CVS) algorithm, based on the score function 

X] '^'^ikCk(i,j) ■ (28) 

The score o-{i,j) for a segment {i,j) is the sum of the dot 
products of the word vectors wik with the maximum likeli¬ 
hood content vector c(i, j) for the segment, with components 


4.1 Evaluation 

To evaluate the performance of our algorithms, we use 
two standard metrics: the Pk metric and the WindowDiff 
(WD) metric. For text segmentation, near misses should 
get more credit than far misses. The Pk metric captures 
the probability for a probe composed of a pair of nearby ele¬ 
ments (at constant distance positions (i, i + k)) to be placed 
in the same segment by both reference and hypothesized seg¬ 
mentations. In particular, the Pk metric counts the number 
of disagreements on the probe elements: 

N-k 

Pk = ^ ^ ^ [5hyp(i, i + k) ^ 5ref(i, i + k)] (30) 

i=l 

^ _ 1 # elements ^ 

nearest 2 ^ SegmeUtS ’ 

integer 

where 5{i,j) is equal to 1 or 0 according to whether or not 
both element i and j are in the same segment in hypothe¬ 
sized and reference segmentations, resp., and the argument 
of the sum tests agreement of the hypothesis and reference 
segmentations, (k is taken to be one less than the integer 
closest to half of the number of elements divided by the 
number of segments in the reference segmentation.) The 
total is then divided by the total number of probes. This 
metric counts the number of disagreements, so lower scores 
indicate better agreement between the two segmentations. 
Trivial strategies such as choosing only a single segmen¬ 
tation, or giving each element its own segment, or giving 
constant boundaries or random boundaries, tend to produce 
values of around 50% [^. 

The Pk metric has the disadvantage that it penalizes false 
positives more severely than false negatives, and can suffer 
when the distribution of segment sizes varies. Pevzner and 

^github.com/alexalemi/segmentation 
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Hearst introduced the WindowDiff (WD) metric: 

^ N-k 

N _k * + fc) / &hyp(*, i + k)] , (31) 

i = l 

where b(i,j) counts the number of boundaries between loca¬ 
tion i and j in the text, and an error is registered if the hy¬ 
pothesis and reference segmentations disagree on the num¬ 
ber of boundaries. In practice, the Pk and WD scores are 
highly correlated, with Pk more prevalent in the literature 
— we will provide both for most of the experiments here. 

4.2 Choi Dataset 

The Choi dataset is used to test whether a segmentation 
algorithm can distinguish natural topic boundaries. It con¬ 
catenates the first n sentences from ten different documents 
chosen at random from a 124 document subset of the Brown 
corpus (the ca** .pos and cj** .pos sets) [^. The number of 
sentences n taken from each document is chosen uniformly 
at random within a range specified by the subset id (i.e., 
as min-max ^sentences). There are four ranges considered: 
(3-5, 6-8, 9-11, 3-11), the first three of which have 100 ex¬ 
ample documents, and the last 400 documents. The dataset 
can be obtained from an archived version of the C99 segmen¬ 
tation code releas^El An extract from one of the documents 
in the test set is shown in Fig. 

4.2.1 C99 benchmark 

We will explore the effect of changing the representation 
and splitting strategy of the C99 algorithm. In order to 
give fair comparisons we implemented our own version of 
the C99 algorithm (oC99). The C99 performance depended 
sensitively on the details of the text preprocessing. Details 
can be found in Appendix [A| 

4.2.2 Effect of word vectors on C99 variant 

The first experiment explores the ability of word vectors 
to improve the performance of the C99 algorithm. The word 
vectors were learned by GloVe on a 42 billion word set of 
the Common Crawl corpus in 300 dimension^ We empha¬ 
size that these word vectors were not trained on the Brown 
or Choi datasets directly, and instead come from a general 
corpus of English. These vectors were chosen in order to 
isolate any improvement due to the word vectors from any 
confounding effects due to details of the training procedure. 
The results are summarized in Table below. The upper 
section cites results from |^, exploring the utility of using 
LSA word vectors, and showed an improvement of a few 
percent over their baseline C99 implementation. The mid¬ 
dle section shows results from which augmented the 
C99 method by representing each element with a histogram 
of topics learned from LDA. Our results are in the lower 
section, showing how word vectors improve the performance 
of the algorithm. 

In each of these last experiments, we turned off the rank 
transformation, pruned the stop words and punctuation, but 
did not stem the vocabulary. Word vectors can be incorpo¬ 
rated in a few natural ways. Vectors for each word in a 

® http: //web. archive . org/web/20010422042459/http: // 
WWW. cs.man.ac.uk/~choif/software/C99-l.2-release. 
tgz (We thank with Martin Riedl for pointing us to the 
dataset.) 

^Obtainable from http://www-nlp.stanford.edu/data/ 
glove.42B.300d.txt.gz 


Some of the features of the top portions of Figure 1 

and Figure 2 were mentioned in discussing Table 1 

First , the Onset Profile spreads across 

approximately 12 years for boys and 10 years for 
girls . 

In contrast , 20 of the 21 lines in the Completion 

Profile ( excluding center 5 for boys and 4 for 
girls ) are bunched and extend over a much 
shorter period , approximately 30 months for boys 
and 40 months for girls . 

The Maturity Chart for each sex demonstrates clearly 
that Onset is a phenomenon of infancy and early 
childhood whereas Completion is a phenomenon of 
the later portion of adolescence . 


The many linguistic techniques for reducing the 

amount of dictionary information that have been 
proposed all organize the dictionary ’s contents 
around prefixes , stems , suffixes , etc . 

A significant reduction in the voume of store 

information is thus realized , especially for a 
highly inflected language such as Russian . 

For English the reduction in size is less striking . 

This approach requires that : ( 1 ) each text word be 

separated into smaller elements to establish a 
correspondence between the occurrence and 
dictionary entries , and ( 2 ) the information 
retrieved from several entries in the dictionary 
be synthesized into a description of the 
particular word . 


Figure 1: Example of two segments from the Choi 
dataset, taken from an entry in the 3—5 set. Note the 
appearance of a “sentence” with the single character 
in the second segment on line 8. These short 
sentences can confound the benchmarks. 


sentence can simply be summed, giving results shown in the 
oC99tf row. But all words are not created equal, so the sen¬ 
tence representation might be dominated by the vectors for 
common words. In the oC99tfidf row, the word vectors are 
weighted by idh = log ^ (i.e., the log of the inverse doc¬ 
ument frequency of each word in the Brown corpus, which 
has 500 documents in total) before summation. We see some 
improvement from using word vectors, for example the Pk 
of 14.78% for the oC99tfldf method on the 3-11 set, com¬ 
pared to Pk of 15.56% for our baseline C99 implementation. 
On the shorter 3-5 test set, our oC99tfidf method achieves 
Pk of 10.27% versus the baseline oC99 Pk of 14.22% . To 
compare to the various topic model based approaches, e.g. 
[18| , we perform spherical fc-means clustering on the word 
vectors and represent each sentence as a histogram of 
its word clusters (i.e., as a vector in the space of clusters, 
with components equal to the number of its words in that 
that cluster). In this case, the word topic representations 
(oC99k50 and oC99k200 in Table do not perform as well 
as the C99 variants of [^. But as was noted in [^, those 
topic models were trained on cross-validated subsets of the 
Choi dataset, and benefited from seeing virtually all of the 
sentences in the test sets already in each training set, so 
have an unfair advantage that would not necessarily convey 
to real world applications. Overall, the results in Table 
illustrate that the word vectors obtained from CloVe can 
markedly improve existing segmentation algorithms. 
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Pk 

WD 

Algorithm 

3-5 

6-8 

9-11 

3-11 

3-5 1 6-8 1 9-11 1 3-11 

C99 14 
C99Ly 

A 

12 

9 

11 

10 

9 

7 

9 

5 


C99 

C99L 

18 

V 

11.20 

4.16 

12.07 

4.89 

oC99 

oC99tf 

oC99tfidf 

oC99k50 

oC99k200 

14.22 

12.14 

10.27 

20.39 

18.60 

12.20 

13.17 

12.23 

21.13 

17.37 

11.59 

14.60 

15.87 

23.76 

19.42 

15.56 

14.91 

14.78 

24.33 

20.85 

14.22 

12.14 

10.27 

20.39 

18.60 

12.22 

13.34 
12.30 

21.34 
17.42 

11.60 

15.22 

16.29 

23.26 

19.60 

15.64 

15.22 

14.96 

24.63 

20.97 


Table 1: Effect of using word vectors in the C99 text segmentation algorithm. and WD results are shown 
(smaller values indicate better performance). The top section (C99 vs. C99LSA) shows the few percent 
improvement over the C99 baseline reported in of using LSA to encode the words. The middle section 
(C99 vs. C99LDA) shows the effect of modifying the C99 algorithm to work on histograms of LDA topics in 
each sentence, from |18| . The bottom section shows the effect of using word vectors trained from GloVe [16| 
in our oC99 implementation of the C99 segmentation algorithm. The oC99tf implementation sums the word 
vectors in each sentence, with no rank transformation, after removing stop words and punctuation. oC99tfidf 
weights the sum by the log of the inverse document frequency of each word. The oC99k models use the word 
vectors to form a topic model by doing spherical fc-means on the word vectors. oC99k50 uses 50 clusters and 
oC99k200 uses 200. 


Algorithm 

rep 

n 

Pk 

WD 

oC99 

tf 

- 

11.78 

11.94 

tfidf 

- 

12.19 

12.27 


tf 

F 

7.68 

8.28 

Euclidean 

T 

9.18 

10.83 

tfidf 

1' 

12.89 

14.27 


T 

8.32 

8.95 


tf 

F 

5.29 

5.39 

Content (CVS) 

T 

5.42 

5.55 

tfidf 

1' 

5.75 

5.87 


T 

5.03 

5.12 


Table 2: Results obtained by varying the scoring 
function. These runs were on the 3—11 set from the 
Choi database, with a word cut of 5 applied, after 
preprocessing to remove stop words and punctua¬ 
tion, but without stemming. The CVS method does 
remarkably better than either the C99 method or a 
Euclidean distance-based scoring function. 


4.2.3 Alternative Scoring frameworks 

The use of word vectors permits consideration of natural 
scoring functions other than C99-style segmentation scoring. 
The second experiment examines alternative scoring frame¬ 
works using the same GloVe word vectors as in the previous 
experiment. To test the utility of the scoring functions more 
directly, for these experiments we used the optimal dynamic 
programming segmentation. Results are summarized in Ta¬ 
ble which shows the average Pk and WD scores on the 
3-11 subset of the Choi dataset. In all cases, we removed 
stop words and punctuation, did not stem, but after prepro¬ 
cessing removed sentences with fewer than 5 words. 

Note first that the dynamic programming results for our 
implementation of C99 with tf weights gives Pk ~ 11.78%, 
3% better than the greedy version result of 14.91% reported 
in Table This demonstrates that the original C99 algo¬ 
rithm and its applications can benefit from a more exact 
minimization than given by the greedy approach. We con¬ 


sidered two natural score functions: the Euclidean scoring 
function (eqn. (20l) which minimizes the sum of the square 
deviations of each vector in a segment from the average vec¬ 
tor of the segment, and the Content Vector scoring (CVS) 
(eqn. < |28[ ) of section 3.3[ ), which uses an approximate log 
posterior for the words in the segment, as determined from 
its maximum likelihood content vector. In each case, we 
consider vectors for each sentence generated both as a strict 
sum of the words comprising it (tf approach), and as a sum 
weighted by the log idf (tfidf approach, as in sec. |4.2.^ . Ad¬ 
ditionally, we consider the effect of normalizing the element 
vectors before starting the score minimization, as indicated 
by the n column. 

The CVS score function eqn. (281 performs the best over¬ 
all, with Pk scores below 6%, indicating an improved seg¬ 
mentation performance using a score function adapted to 
the choice of representation. While the most principled 
score function would be the Content score function using 
tf weighted element vectors without normalization, the nor¬ 
malized tfidf scheme actually performs the best. This is 
probably due to the uncharacteristically large effect com¬ 
mon words have on the element representation, which the 
log idf weights and the normalization help to mitigate. 

Strictly speaking, the idf weighted schemes cannot claim 
to be completely untrained, as they benefit from word usage 
statistics in the Choi test set, but the raw CVS method still 
demonstrates a marked improvement on the 3-11 subset, 
5.29% Pk versus the optimal C99 baseline of 11.78% Pk- 


4.2.4 Effect of Splitting Strategy 

To explore the effect of the splitting strategy and to com¬ 
pare with our overall results on the Choi test set against 
other published benchmarks, in our third experiment we ran 
the raw CVS method against all of the Choi test subsets, us¬ 
ing all three splitting strategies discussed: greedy, refined, 
and dynamic programming. These results are summarized 
in Table [3 

Overall, our method outperforms all previous untrained 
methods. As commented regarding Table (toward the end 
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Alg 

3-5 

6-8 

9-11 

3-11 

TT 1 



44 

43 

48 

46 

C99 

i 


12 

9 

9 

12 

COl 

3 


10 

7 

5 

9 

UOO 



9 

7 

5 

10 

F04 

10 


5.5 

3.0 

1.3 

7.0 

G-CV^ 

5.14 

4.82 

6.38 

6.49 

R-CVS 


3.92 

3.75 

5.17 

5.65 

DP-CVS 

3.41 

3.45 

4.45 

5.29 

MOO 

15 

2.2 

2.3 

4.1 

2.3 

R12 



1.24 

0.76 

0.56 

0.95 

D13 



1.0 

0.9 

1.2 

0.6 


Table 3: Some published Pk results on the Choi 
dataset against our raw CVS method. G-CVS uses 
a greedy splitting strategy, R-CVS uses up to 20 it¬ 
erations to refine the results of the greedy strategy, 
and DP-CVS shows the optimal results obtained by 
dynamic programming. We include the topic mod¬ 
eling results MOO, R12, and D13 for reference, but 
for reasons detailed in the text do not regard them 
as comparable, due to their mingling of test and 
training samples. 


of subsection|4.2.2[| , we have included the results of the topic 


modeling based approaches MOO 15 , R12 [^, and D13 


for reference. But due to repeat appearance of the same sen¬ 
tences throughout each section of the Choi dataset, methods 
that split that dataset into test and training sets have un¬ 
avoidable access to the entirety of the test set during train¬ 
ing, albeit in different orderlj These results can therefore 
only be compared to other algorithms permitted to make 
extensive use of the test data during cross-validation train¬ 
ing. Only the TT, COO, UOO and raw CVS method can be 
considered as completely untrained. The COl method de¬ 
rives its LSA vectors from the Brown corpus, from which 
the Choi test set is constructed, but that provides only a 
weak benefit, and the F04 method is additionally trained on 
a subset of the test set to achieve its best performance, but 
its use only of idf values provides a similarly weak benefit. 

We emphasize that the raw CVS method is completely 
independent of the Choi test set, using word vectors derived 
from a completely different corpus. In Fig. we repro¬ 
duce the relevant results from the last column of Table [Uto 
highlight the performance benefits provided by the semantic 
word embedding. 

Note also the surprising performance of the refined split¬ 
ting strategy, with the R-CVS results in Table[^much lower 
than the greedy G-CVS results, and moving close to the op¬ 
timal DP-CVS results, at far lower computational cost. In 
particular, taking the dynamic programming segmentation 
as the true segmentation, we can assess the performance of 
the refined strategy. As seen in Table the refined segmen¬ 
tation very closely approximates the optimal segmentation. 

This is important in practice since the dynamic program¬ 
ming segmentation is much slower, taking five times longer 
to compute on the 3-11 subset of the Choi test set. The 
dynamic programming segmentation becomes computation¬ 
ally infeasible to do at the scale of word level segmentation 
on the arXiv dataset considered in the next section, whereas 


Mn 


18 


it is observed that “This makes the Choi data set 


artificially easy for supervised approaches.” See appendix [B| 



Figure 2: Results from last column of Tablerepro¬ 
duced to highlight the performance of the CVS seg¬ 
mentation algorithm compared to similar untrained 
algorithms. Its superior performance in an nnsuper- 
vised setting suggests applications on documents “in 
the wild”. 



3-5 

6-8 

9-11 

3-11 

R-CVS vs DP-CVS [sT 

0.90 

0.65 

1.16 

1.15 


Table 4: Treating the dynamic programming splits 
as the true answer, the error of the refined splits as 
measured in Pk across the subsets of the Choi test 
set. 


the refined segmentation method remains eminently feasible. 

4.3 ArXiv Dataset 

Performance evaluation on the Choi test set implements 
segmentation at the sentence level, i.e., with segments of 
composed of sentences as the basic elements. But text sources 
do not necessarily have well-marked sentence boundaries. 
The arXiv is a repository of scientific articles which for prac¬ 
tical reasons extracts text from PDF documents (typically 
using pdf miner/pdf 2txt .py). That Postscript-based for¬ 
mat was originally intended only as a means of formatting 
text on a page, rather than as a network transmission for¬ 
mat encoding syntactic or semantic information. The result 
is often somewhat corrupted, either due to the handling of 
mathematical notation, the presence of footers and headers, 
or even just font encoding issues. 

To test the segmentation algorithms in a realistic set¬ 
ting, we created a test set similar to the Choi test set, but 
based on text extracted from PDFs retrieved from the arXiv 
database. Each test document is composed of a random 
number of contiguous words, uniformly chosen between 100 
and 300, sampled at random from the text obtained from 
arXiv articles. The text was preprocessed by lowercasing 
and inserting spaces around every non-alphanumeric char¬ 
acter, then splitting on whitespace to tokenize. An example 
of two of the segments of the first test document is shown 
in Figure]^ below. 

This is a much more difficult segmentation task: due to 
the presence of numbers and many periods in references, 
there are no clear sentence boundaries on which to initially 
group the text, and no natural boundaries are suggested in 
the test set examples. Here segmentation algorithms must 
work directly at the “word” level, where word can mean a 
punctuation mark. The presence of garbled mathematical 
formulae adds to the difficulty of making sense of certain 
streams of text. 

In Table we summarize the results of three word vector 
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. nature_414 : 441 - 443 . 12 seinen , i . and schram 
a . 2006 . social_status and group norms : 

indirect_reciprocity in a helping experiment . 
european_economic_review 50 : 581 - 602 . silva , 

e . r . , jaffe , k . 2002 . expanded food 

choice as a possible factor in the evolution of 
eusociality in vespidae sociobiology 39 : 25 - 36 
. smith , j . , van dyken , j . d . , zeejune , 
p . c . 2010 . a generalization of hamilton ’ s 
rule for the evolution of microbial cooperation 
science_328 , 1700 - 1703 . zhang , j . , wang , 
j . , sun , s . , wang , 1 . , wang , z . , xia , 
c . 2012 . effect of growing size of interaction 

neighbors on the evolution of cooperation in 
spatial snowdrift_game . chinese_science bulletin 
57 : 724 - 728 . Zimmerman , m . , egu ‘i luz , 

V . , san_miguel , 

of ) e , equipped_with the topology of 

weak_convergence . we will state some results 
about random measures . 10 definition a . 1 ( 

first two moment measures ) . for a 

random_variable z , taking values in p ( e ) , 

and k = 1 , 2 , . . . , there is a 

uniquely_determined measure /j,(k) onb(ek) 
such that e [ z ( al ) z(ak)] = /i.(k) 

( al X X ak ) for al , . . . , ak G b ( e ) 

. this is called the kth_moment measure . 
equivalently , /x ( k ) is the unique measure such 
that e [ hz , 0 li hz ,^ki] = h/i(k) 

J (f> 1 0 ki , where h . , . i denotes 

integration . lemma a . 2 ( characterisation of 
deterministic random measures ) . let z be a 

random_variable_taking values in p ( e ) with the 

first two moment measures /x : = /x ( 1 ) and /x ( 

2 ) . then the 

following_assertions_are_equivalent : 1 . there 

is Z/* G p ( e ) with z = u , almost_surely . 2 . 
the second_moment measure has product - form , i 
. e ./x(2)=/x0/x( which is equivalent to e 
[hz , <p 1± ■ hz , (p 2± 1 = h ^ , <p 1± ■ h fi , (p 

2i ( this is in fact equivalent to e [ hz , ^ i2 

] 




Figure 3: Example of two of the segments from a 
document in the arXiv test set. 


powered approaches, comparing a C99 style algorithm to 
our content vector based methods, both for unnormalized 
and normalized word vectors. Since much of the language 
of the scientific articles is specialized, the word vectors used 
in this case were obtained from GloVe trained on a corpus 
of similarly preprocessed texts from 98,392 arXiv articles. 
(Since the elements are now words rather than sentences, 
the only issue involves whether or not those word vectors 
are normalized.) As mentioned, the dynamic programming 
approach is prohibitively expensive for this dataset. 

We see that the CVS method performs far better on the 
test set than the C99 style segmentation using word vectors. 
The Pk and WD values obtained are not as impressive as 
those obtained on the Choi test set, but this test set offers 
a much more challenging segmentation task: it requires the 
methods to work at the level of words, and as well includes 
the possibility that natural topic boundaries occur in the test 
set segments themselves. The segmentations obtained with 
the CVS method typically appear sensibly split on section 
boundaries, references and similar formatting boundaries, 
not known in advance to the algorithm. 

As a final illustration of the effectiveness of our algorithm 
at segmenting scientihc articles, we’ve applied the best per¬ 
forming algorithm to this article. Fig. shows how the al¬ 
gorithm segments the text roughly along section borders. 





Figure 4: Effect of applying our segmentation algo¬ 
rithm to this paper with 40 segments. The segments 
are denoted with alternating color overlays. 
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Alg 

S 

Pk 

WD 

oC99 

G 

47.26 

47.26 

oC99 

R 

47.06 

49.16 

CVS 

G 

26.07 

28.23 

CVS 

R 

25.55 

27.73 

CVSn 

G 

24.63 

26.69 

CVSn 

R 

24.03 

26.15 


Table 5: Results on the arXiv test set for the 

C99 method using word vectors (oC99), our CVS 
method, and CVS method with normalized word 
vectors (CVSn). The Pk and WD metrics are given 
for both the greedy (G) and refined splitting strate¬ 
gies (R), with respect to the reference segmentation 
in the test set. The refined strategy was allowed up 
to 20 iterations to converge. The refinement con¬ 
verged for all of the CVS runs, but failed to converge 
for some documents in the test set under the C99 
method. Refinement improved performance in all 
cases, and our CVS methods improve significantly 
over the C99 method for this task. 

5. CONCLUSION 

We have presented a general framework for describing and 
developing segmentation algorithms, and compared some ex¬ 
isting and new strategies for representation, scoring and 
splitting. We have demonstrated the utility of semantic 
word embeddings for segmentation, both in existing algo¬ 
rithms and in new segmentation algorithms. On a real world 
segmentation task at word level, we’ve demonstrated the 
ability to generate useful segmentations of scientific articles. 
In future work, we plan to use this segmentation technique 
to facilitate retrieval of documents with segments of con¬ 
centrated content, and to identify documents with localized 
sections of similar content. 
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APPENDIX 

A. DETAILS OF C99 REPRESENTATION 

This set of experiments compare to the results reported in 

P . We implemented our own version of the C99 algorithm 
C99) and tested it on the Choi dataset. We explored the 
effect of various changes to the representation part of the 
algorithm, namely the effects of removing stop words, cut¬ 
ting small sentence sizes, stemming the words, and perform¬ 
ing the rank transformation on the cosine similarity matrix. 
For stemming, the implementation of the Porter stemming 
algorithm from nltk was used. For stopwords, we used the 
list distributed with the C99 code augmented by a list of 
punctuation marks. The results are summarized in Table 
While we reproduce the results reported in without 
the rank transformation (C99 in table [^, our results for 
the rank transformed results (last two lines for oC99) show 
better performance without stemming. This is likely due to 
particulars relating to details of the text transformations, 
such at the precise stemming algorithm and the stopword 
list. We attempted to match the choices made in as 
much as possible, but still showed some deviations. 

Perhaps the most telling deviation is the 1.5% swing in 
results for the last two rows, whose only difference was a 
change in the tie breaking behavior of the algorithm. In our 
best result, we minimized the objective at each stage, so in 
the case of ties would break at the earlier place in the text, 
whereas for the TBR row, we maximized the negative of the 
objective, so in the case of ties would break on the rightmost 
equal value. 

These relatively large swings in the performance on the 
Choi dataset suggest that it is most appropriate to com¬ 
pare differences in parameter settings for a particular im¬ 
plementation of an algorithm. Comparing results between 
different articles to assess performance improvements due to 
algorithmic changes hence requires careful attention to the 
implemention details. 

B. OVERFITTING THE CHOI DATASET 

Recall from sec. |4.2| that each sample document in the 
Choi dataset is composed of 10 segments, and each such 
segment is the hrst n sentences from one of a 124 document 
subset of the Brown corpus (the ca**.pos and cj**.pos 
sets). This means that each of the four Choi test sets (n = 
3-5, 6-8, 9-11, 3-11) necessarily contains multiple repetitions 
of each sentence. In the 3-5 Choi set, for example, there are 
3986 sentences, but only 608 unique sentences, so that each 
sentence appears on average 6.6 times. In the 3-11 set, with 
400 sample documents, there are 28,145 sentences, but only 
1353 unique sentences, for an average of 20.8 appearances 
for each sentence. Furthermore, in all cases there are only 
124 unique sentences that can begin a new segment. This 
redundancy means that a trained method such as LDA will 
see most or all of the test data during training, and can 
easily overfit to the observed segmentation boundaries, es¬ 
pecially when the number of topics is not much smaller than 
the number of documents. For example, using standard 10- 
fold cross validation on an algorithm that simply identifies 
a segment boundary for any sentence in the test set that be¬ 
gan a document in the training set gives better than 99.9% 
accuracy in segmenting all four parts of the Choi dataset. 
For this reason, we have not compared to the topic-modeling 
based segmentation results in Tables and 


note 

cut 

stop 

stem 

rank 

Pk (%) 

WD (%) 

C99 [ 4 ] 

0 

T 

F 

0 

23 

- 


0 

T 

F 

11 

13 

- 


0 

T 

T 

11 

12 

- 

oC99 

0 

T 

F 

0 

22.52 

22.52 


0 

T 

F 

11 

16.69 

16.72 


0 

T 

T 

11 

17.90 

19.96 

Reps 

0 

F 

F 

0 

32.26 

32.28 


5 

F 

F 

0 

32.73 

32.76 


0 

T 

F 

0 

22.52 

22.52 


0 

F 

T 

0 

32.26 

32.28 


0 

T 

T 

0 

23.33 

23.33 


5 

T 

T 

0 

23.56 

23.59 


5 

T 

T 

3 

18.17 

18.30 


5 

T 

T 

5 

17.44 

17.56 


5 

T 

T 

7 

16.95 

17.05 


5 

T 

T 

9 

17.12 

17.20 


5 

T 

T 

11 

17.07 

17.14 


5 

T 

T 

13 

17.11 

17.19 

TBR 

5 

T 

F 

11 

17.04 

17.12 

Best 

5 

T 

F 

11 

15.56 

15.64 


Table 6: Effects of text representation on the per¬ 
formance of the C99 algorithm. The cut column de¬ 
notes the cutoff for the length of a sentence after pre¬ 
processing. The stop column denotes whether stop 
words and punctuation are removed. The stem col¬ 
umn denotes whether the words are passed through 
the Porter stemming algorithm. The rank column 
denotes the size of the kernel for the ranking trans¬ 
formation. Evaluations are given both as the Pk 
metric and the Window Diff (WD) score. All ex¬ 
periments are done on the 400 test documents in 
the 3—11 set of the Choi dataset. The upper sec¬ 
tion cites results contained in the CWM 2000 paper 
[ 4 ) . The second section is an attempt to match these 
results with our implementation (oC99). The third 
section attempts to give an overview of the effect of 
different parameter choices for the representation 
step of the algorithm. The last section reports our 
best observed result as well as a run (TBR) with 
the same parameter settings, but with a tie-breaking 
strategy that takes right-most rather then left-most 
equal value. 
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